library(quanteda)
At the moment, two quanteda objects, dfm and kwic have custom plot methods: dfm is plotted as a wordcloud, kwic as a lexical dispersion plot. There are also other plots of interest which can be made with the standard R techniques.
Plotting a dfm object will create a wordcloud using the wordcloud pacakge.
inaugDfm <- dfm(inaugCorpus)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... indexing features: 9,214 feature types
## ... created a 57 x 9215 sparse dfm
## ... complete.
## Elapsed time: 0.391 seconds.
suppressWarnings( # Some words will not fit on a plot this size, so suppress those warings
plot(inaugDfm)
)
You can also plot a “comparison cloud”, but this can only be done with fewer than eight documents:
firstDfm <- dfm(texts(inaugCorpus)[0:8])
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 8 documents
## ... indexing features: 2,668 feature types
## ... created a 8 x 2669 sparse dfm
## ... complete.
## Elapsed time: 0.029 seconds.
suppressWarnings( # Some words will not fit on a plot this size, so suppress those warings
plot(firstDfm, comparison=T)
)
Plot will pass through additional arguments to the underlying call to wordcloud.
inaugDfm <- dfm(inaugCorpus,
colors=c('red', 'yellow', 'pink', 'green', 'purple', 'orange', 'blue')
)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## Warning in tokenize.character(x, removeNumbers = removeNumbers,
## removeSeparators = removeSeparators, : Argument colors not used.
##
## ... indexing documents: 57 documents
## ... indexing features: 9,214 feature types
## ... created a 57 x 9215 sparse dfm
## ... complete.
## Elapsed time: 0.182 seconds.
suppressWarnings( # Some words will not fit on a plot this size, so suppress those warings
plot(inaugDfm)
)
Plotting a kwic object produces a lexical dispersion plot which allows us to visualize the occurrences of particular terms throughout the text.
# using words from tokenized corpus for dispersion
plot(kwic(inaugCorpus, "american"))
You can also pass multiple kwic objects to plot to compare the dispersion of different terms:
plot(
kwic(inaugCorpus, "american"),
kwic(inaugCorpus, "people"),
kwic(inaugCorpus, "communist")
)
If you’re only plotting a single document, but with multiple keywords, then the keywords are displayed one below the other rather than side-by-side.
inaugAdams <- corpus(inaugTexts[[3]])
# plot(
# kwic(inaugAdams, "america"),
# kwic(inaugAdams, "citizen"),
# kwic(inaugAdams, "heart")
# )
You might also have noticed that the x-axis scale is the absolute token index for single texts and relative token index when multiple texts are being compared. If you prefer, you can explicity specify that you want an absolute scale:
plot(
kwic(inaugCorpus, "american"),
kwic(inaugCorpus, "people"),
kwic(inaugCorpus, "communist"),
scale='absolute'
)
In this case, the texts may not have the same length, and the tokens that don’t exist in a particular text are shaded in grey.
The object returned is a ggplot object, which can be modified using ggplot:
library(ggplot2)
theme_set(theme_bw())
g <- plot(
kwic(inaugCorpus, "american"),
kwic(inaugCorpus, "people"),
kwic(inaugCorpus, "communist")
)
g + aes(color = keyword) + scale_color_manual(values = c('blue', 'red', 'green'))
You can plot the frequency of the top features in a text using topfeatures.
inaugFeatures <- topfeatures(inaugDfm, 100)
# Create a data.frame for ggplot
topDf <- data.frame(
list(
term = names(inaugFeatures),
frequency = unname(inaugFeatures)
)
)
# Sort by reverse frequency order
topDf$term <- with(topDf, reorder(term, -frequency))
ggplot(topDf) + geom_point(aes(x=term, y=frequency)) +
theme(axis.text.x=element_text(angle=90, hjust=1))
If you wanted to compare the frequency of a single term across different texts, you could plot the dfm matrix like this:
americanFreq <- data.frame(list(
document = rownames(inaugDfm[, 'american']),
frequency = unname(as.matrix(inaugDfm[, 'american']))
))
ggplot(americanFreq) + geom_point(aes(x=document,y=frequency)) +
theme(axis.text.x=element_text(angle=90, hjust=1))
The above plots are raw frequency plots. For relative frequency plots, (word count divided by the length of the chapter) we can weight the document-frequency matrix. To obtain expected word frequency per 100 words, we multiply by 100. To get a feel for what the resulting weighted dfm (document feature matrix) looks like, you can inspect it with the head function, which prints the first few rows and columns.
relDfm <- weight(inaugDfm, type='relFreq') * 100
head(relDfm)
## Document-feature matrix of: 57 documents, 9,215 features.
## (showing first 6 documents and first 6 features)
## features
## docs fellow-citizens of the senate and
## 1789-Washington 0.06993007 4.965035 8.111888 0.06993007 3.356643
## 1793-Washington 0.00000000 8.148148 9.629630 0.00000000 1.481481
## 1797-Adams 0.12942192 6.039689 7.031924 0.04314064 5.608283
## 1801-Jefferson 0.11587486 6.025492 7.531866 0.00000000 4.692932
## 1805-Jefferson 0.00000000 4.662973 6.602031 0.00000000 4.293629
## 1809-Madison 0.08510638 5.872340 8.851064 0.00000000 3.659574
## features
## docs house
## 1789-Washington 0.1398601
## 1793-Washington 0.0000000
## 1797-Adams 0.0000000
## 1801-Jefferson 0.0000000
## 1805-Jefferson 0.0000000
## 1809-Madison 0.0000000
relFreq <- data.frame(list(
document = rownames(inaugDfm[, 'american']),
frequency = unname(as.matrix(inaugDfm[, 'american']))
))
ggplot(relFreq) + geom_point(aes(x=document,y=frequency)) +
theme(axis.text.x=element_text(angle=90, hjust=1))